Satisficing in Time-Sensitive Bandit Learning

نویسندگان

  • Daniel Russo
  • Benjamin Van Roy
چکیده

Much of the recent literature on bandit learning focuses on algorithms that aim to converge on an optimal action. One shortcoming is that this orientation does not account for time sensitivity, which can play a crucial role when learning an optimal action requires much more information than near-optimal ones. Indeed, popular approaches such as upper-confidence-bound methods and Thompson sampling can fare poorly in such situations. We consider instead learning a satisficing action, which is near-optimal while requiring less information, and propose satisficing Thompson sampling, an algorithm that serves this purpose. We establish a general bound on expected discounted regret and study the application of satisficing Thompson sampling to linear and infinite-armed bandits, demonstrating arbitrarily large benefits over Thompson sampling. We also discuss the relation between the notion of satisficing and the theory of rate distortion, which offers guidance on the selection of satisficing actions.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Time-Sensitive Bandit Learning and Satisficing Thompson Sampling

The literature on bandit learning and regret analysis has focused on contexts where the goalis to converge on an optimal action in a manner that limits exploration costs. One shortcomingimposed by this orientation is that it does not treat time preference in a coherent manner.Time preference plays an important role when the optimal action is costly to learn relative tonear-o...

متن کامل

Satisficing: A ‘Pretty Good’ Heuristic

One of the best known ideas in the study of bounded rationality is Simon’s satisficing; yet we still lack a standard formalization of the heuristic and its implications. We propose a mathematical model of satisficing which explicitly represents agents’ aspirations and which explores both singleperson and multi-player contexts. The model shows that satisficing has a signature performanceprofile ...

متن کامل

The Blinded Bandit: Learning with Adaptive Feedback

We study an online learning setting where the player is temporarily deprived of feedback each time it switches to a different action. Such model of adaptive feedback naturally occurs in scenarios where the environment reacts to the player’s actions and requires some time to recover and stabilize after the algorithm switches actions. This motivates a variant of the multi-armed bandit problem, wh...

متن کامل

Large-Scale Bandit Problems and KWIK Learning

We show that parametric multi-armed bandit (MAB) problems with large state and action spaces can be algorithmically reduced to the supervised learning model known as “Knows What It Knows” or KWIK learning. We give matching impossibility results showing that the KWIKlearnability requirement cannot be replaced by weaker supervised learning assumptions. We provide such results in both the standard...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2018